Wavelet subband-specific learning for low-dose computed tomography denoising

Deep neural networks have shown great improvements in low-dose computed tomography (CT) denoising. Early algorithms were primarily optimized to obtain an accurate image with low distortion between the denoised image and reference full-dose image at the cost of yielding an overly smoothed unrealistic CT image. Recent research has sought to preserve the fine details of denoised images with high perceptual quality, which has been accompanied by a decrease in objective quality due to a trade-off between perceptual quality and distortion. We pursue a network that can generate accurate and realistic CT images with high objective and perceptual quality within one network, achieving a better perception-distortion trade-off. To achieve this goal, we propose a stationary wavelet transform-assisted network employing the characteristics of high- and low-frequency domains of the wavelet transform and frequency subband-specific losses defined in the wavelet domain. We first introduce a stationary wavelet transform for the network training procedure. Then, we train the network using objective loss functions defined for high- and low-frequency domains to enhance the objective quality of the denoised CT image. With this network design, we train the network again after replacing the objective loss functions with perceptual loss functions in high- and low-frequency domains. As a result, we acquired denoised CT images with high perceptual quality using this strategy while minimizing the objective quality loss. We evaluated our algorithms on the phantom and clinical images, and the quantitative and qualitative results indicate that ours outperform the existing state-of-the-art algorithms in terms of objective and perceptual quality.


Introduction
X-ray computed tomography (CT) is widely used in many industries and is an essential clinical diagnostic tool. Moreover, it provides a noninvasive method of obtaining clinical information from patients. However, high radiation exposure is a concern in the use of CT. According to US statistics, the increased use of CT scans contributes to the potential risk of lung cancer [1]. Thus, a CT scan must be performed under the principle of as low as reasonably achievable [2].

PLOS ONE
PLOS ONE | https://doi.org/10.1371/journal.pone.0274308 September 9, 2022 1 / 25 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Therefore, low-dose CT (LDCT) has been increasingly adopted. However, reducing CT radiation produces more noise in the CT scans; thus, research on LDCT denoising has been widely conducted in the medical imaging field. In recent years, deep learning algorithms using a convolutional neural network (CNN) have demonstrated excellent performance compared to traditional machine learning algorithms in the computer vision community. This trend has also occurred in CT research, and LDCT denoising has benefited considerably from CNN denoising. An encoder-decoder CNN designed with a residual connection [3] was developed and proved that noise on the Mayo Clinic's data can be removed effectively. Yang et al. [4] used 2D and 3D CNNs with residual networks. Kang et al. [5] provided an iterative framelet-based denoising algorithm.
Although these methods demonstrated successful results with high objective quality, the pixel-level loss based on the mean squared error (MSE) or mean absolute error (MAE) generated overly smoothed images with a significant loss in detailed texture and edges, which is not beneficial from the perspective of human visual perception [6]. Thus, after the early deep learning development of LDCT denoising, the current LDCT denoising goal has moved toward pursuing high perceptual quality to recover the details in denoised images.
VGG loss [7] and the generative adversarial network (GAN) [8] are commonly adopted when pursuing high perceptual quality in LDCT denoising. Badretale et al. [9] defined the loss function using the perceptual loss generated from the Visual Geometry Group (VGG) network [10] to better catch the details of texture and preserve the edges. Wolterink et al. [11] used the GAN for CT denoising, and Yi et al. [12] combined the conditional generative network and sharpness detection network to prevent blurring while denoising. The Wasserstein distance and perceptual loss were used through the GAN by Yang et al. [13]. You et al. [14] employed 3D volumetric information and perceptual loss with the GAN. Shan et al. [15] found that the 3D CNN has better results than the 2D CNN, and it can be trained by transfer-learning from the 2D trained network with GAN loss. Choi et al. [16] used statistical information and Li et al. [17] employed 3D self-attention to retrieve the denoised image with GAN loss.
To clarify the terminology in this paper, the low-distortion image referred to in the LDCT image denoising task indicates an image with high objective quality or a high value of the peak signal to noise ratio (PSNR). In addition, an image that preserves fine details or sharp edges that may provide important information for clinical diagnosis is referred to as a high perceptual image. High objective quality can be obtained when a network is trained with pixel-wise loss (MSE or MAE), which we call objective loss (L o ). High perceptual quality can be achieved by optimizing the VGG loss (L vgg ) or adversarial loss (L adv ). If at least one of these losses is used to train a network, the network optimizes the perceptual loss (L p ).
When an algorithm is trained based on perceptual loss, the resulting decrease in PSNR of the image can be explained as the perception-distortion (PD) trade-off, and a PD bound exists for image denoising algorithms [18]. Due to the PD trade-off phenomenon in the LDCT image denoising task, if we seek an image with a high PSNR, we obtain a blurred image that is inappropriate for clinical diagnostic use. In contrast, if we aim to achieve an image with high perception, we must be aware that the image noise increases and PSNR decreases.
To illustrate the PD trade-off with a visual example of LDCT, we optimized the representative image enhancement networks, U-Net [19] and EDSR [20], based only on either objective or perceptual loss. We compared their resulting images as depicted in Fig 1. The PSNR values corresponding to the denoised output images are summarized in Table 1. For EDSR, we added a global skip connection for LDCT denoising. Fig 1 and  Although recent LDCT denoising studies by Yang et al. [13] and Shan et al. [15] have focused more on perceptual quality than the loss of objective quality, in our algorithm, we prioritize objective quality as highly as perceptual quality, which is expected to improve the PD bound because the problem of the low objective quality image is that it inherently has higher noise in the image. Fig 2 illustrates that IrCNN [21], which has a lower PSNR than EDSR [20], has poor noise reduction, thus, it makes harder to see details of structures. EDSR exhibited better noise reduction performance with a higher PSNR (i.e., higher objective quality) and better visibility than IrCNN although it resulted in an blurred image. In this example, we can see that high objective quality has also many advantages in denoising algorithms, thus, sacrificing high objective quality when seeking perceptual quality does not always have better results. https://doi.org/10.1371/journal.pone.0274308.g001 Table 1. Change in the peak signal to noise ratio (PSNR) value when U-Net and EDSR are optimized for perceptual loss (L p ) or objective loss (L o ). The networks had a lower PSNR value when optimized for perceptual loss than when optimized for objective loss.

PLOS ONE
When we compare two identical networks, one maximizing objective quality and the other maximizing perceptual quality, the latter cannot exceed the objective quality of the former network. In other words, the maximum objective quality is determined by the network capability itself without considering any operation for perceptual quality. Thus, to obtain high objective and perceptual quality, we must first design a network that exhibits the best performance in objective quality when optimizing only objective loss. Then, with this network, we aim to secure the perceptual quality as much as possible while minimizing the loss of the objective quality.
The supervised learning-based LDCT denoising techniques have difficulty in obtaining perfectly paired low-and normal-dose images. Recently, unsupervised and semi-supervised learning algorithms have been developed for LDCT denoising to eliminate the needs of highquality reference for training. Kim et al. [22] and Yuan et al. [23] provided a practical way to train the networks and generate both the training input and realistic label from the existing data with the help of physics-based CT noise model. Tang et al. [24] adopted CycleGAN [25] to train unpaired dataset for LDCT denoising. Although these methods have shown promising results, their performance is still inferior to that of supervised learning [26] and they have difficulty in preserving fine anatomical details due to their simple noise model [27].
In this paper, we suggested a novel LDCT denoising strategy based on the wavelet transform to enhance both objective and perceptual quality. The wavelet transform has been used in several studies on deep learning-based image denoising [5,28,29]. However, none of these studies have taken full advantage of the strength of the wavelet properties for better objective and perceptual quality. All previous studies have used the wavelet transform only as input or an operation of one layer but never used the frequency properties.
The wavelet transform can decompose a signal into high-and low-frequency subbands with their own properties. The low-frequency subband is responsible for the overall objective quality, whereas high-frequency subbands are very sensitive to small changes in fine details and substantially influence the perceptual quality. We employed the characteristics of highand low-frequency subbands of the wavelet transform and defined the losses in the wavelet domain. With this wavelet domain loss, we minimized the loss of the objective quality when seeking perceptual quality in one network.
The main contributions of this paper are as follows: • A stationary wavelet transform-assisted network is proposed to perform the LDCT image denoising task using newly defined wavelet losses in low and high frequency wavelet subbands. The network achieved the highest objective quality in LDCT denoising compared to the current state-of-the-art denoising algorithms for natural RGB and LDCT images. https://doi.org/10.1371/journal.pone.0274308.g002

PLOS ONE
• We also proposed a novel wavelet subband-specific learning strategy that allows our method to recover high perceptual quality with less compensation for high objective quality. Our method achieved competitive perceptual quality with the highest objective quality compared to the current state-of-the-art LDCT denoising methods.
• Our extensive experiments on real datasets (in vivo and phantom data) reveal that the proposed methods convincingly improve the denoising performance with a better PD trade-off over the existing state-of-the-art algorithms.

Overall architecture
Fig 3 displays the overall architecture of the proposed method. The network is based on EDSR [20], but we modified the network by adding a global skip connection for denoising. After building the base network, we adopted a stationary wavelet transform (SWT) to enhance the objective quality further. We applied a level 2 SWT to decompose the input noisy LDCT images into seven different frequency subbands. Then, we normalized each subband and used them as input to the network. We defined the wavelet loss to optimize the objective loss with Overall architecture of the proposed methods, where G is a generator (denoiser). D high and D low are discriminators in high-and low-frequency domains, VGG loss used VGG19 [10] and L 1 loss is the mean absolute error (MAE), each of which calculates the loss between entered two inputs of denoised and ground truth clean images. The variables x,ỹ, and y denote the noisy input CT image, denoised CT output image, and clean CT image. Further, wx, wỹ and wy are denoted as x,ỹ and y in the SWT domain with normalization in each subband. From wỹ and wy, wỹ low and wy low is the lowfrequency subband, and wỹ high and wy high includes the high frequency subbands. https://doi.org/10.1371/journal.pone.0274308.g003

PLOS ONE
the network output in the wavelet domain and secured the maximum objective quality first with our proposed network. Finally, to obtain denoised images with high perceptual quality, we redefined the loss function in the wavelet domain by introducing perceptual loss, the VGG and adversarial losses. The purpose of the redefined loss function in the wavelet domain was to increase the perceptual quality while maintaining the objective quality maximized in the previous step. To achieve this goal, we assigned different weights for each loss in high-and low-frequency subbands to use the characteristics of high-and low-frequency image components. In low-frequency domain loss, we assigned more weight to the objective loss term to increase the objective quality, whereas we assigned more weight to perceptual loss terms for high-frequency domain loss, enhancing the perceptual quality. The relevant details are described in the following subsections.

Generator and discriminator
Generator and discriminator networks are depicted in Fig 4. The generator, used as a denoiser, consists of one convolution for feature extraction, 32 residual blocks, and a final block convolution for image reconstruction. Each residual block has convolution, ReLU, and convolution sequentially. Moreover, each convolution is defined with 3 × 3 kernel size, 1 stride, 1 padding, and 96 channels. The global skip connection is used to learn the residuals, such as a DnCNN [30]. We have two discriminators for high-and low-frequency domains. The network architecture of the discriminator is based on PatchGAN [31].

Stationary wavelet transform and subband analysis
A wavelet transform decomposes a signal into a set of basis functions consisting of contractions, expansions, and translations of a mother function, called the wavelet, enabling multiresolution image analysis [32]. The classical discrete wavelet transform (DWT) usually decomposes the original image into a sequence of new images with decreased size, and the SWT decomposes a signal into new images with the same size as the original image. Both the DWT and SWT have the advantage of expanding their receptive fields because of downsampling in the DWT or upsampling the convolutional filter in the SWT.
The SWT overcomes the drawback of the DWT, which is not shift-invariant. Moreover, using the SWT enables us to build better networks that achieve higher objective quality performance than the U-Net or encoder-decoder architecture with the DWT adopted [28]. Thus, although previous wavelet-based image denoising studies used the DWT [28,29], we used the SWT, considering the relative advantages of the SWT.
The SWT is implemented using the filter-bank algorithm, which is depicted in Fig 5. We used the Haar function as our wavelet function. Let h and g be the scaling and wavelet filter, respectively. Then, the SWT of the scaling filter and wavelet filter at scale j + 1 is defined

PLOS ONE
recursively as follows: . The Jth level SWT of an image x is then calculated recursively with the filter-bank operations.
Since the low-frequency subband LL 2 contains most of the energy (i.e., the overall shape) of the original image as depicted in Fig 6, it plays a more dominant role than other high-frequency subbands in determining the objective image quality. However, high-frequency subbands are as important as low-frequency subbands because the high-frequency subbands present textural details and differences in the objective quality of the state-of-the-art algorithms are very small. Thus, we managed high-frequency subbands carefully by determining the best combination of the weights for each frequency subband to maximize the objective quality.
Comparing the histograms of low-dose and normal-dose images in Fig 6 reveals that their distributions are similar in low-frequency subbands but different in high-frequency subbands, which implies that we should manage high-frequency subbands more precisely. When we minimize objective loss functions, we can increase our objective quality. However, this For better visuality, we included only LL 2 , LH 2 , HL 2 , and HH 2 subbands. Four different images correspond to histograms for the lowfrequency (LL 2 ) and high-frequency subbands (LH 2 , HL 2 , and HH 2 ). https://doi.org/10.1371/journal.pone.0274308.g006

PLOS ONE
optimization process tends to make distributions of high-frequency subbands centered on 0. As a result, the network that optimizes the objective loss function yields an overly smoothed denoised image with lost detailed information. With this observation, we must alleviate this zero-centered distribution with new defined perceptual losses. In our approach, we divided loss functions into high-and low-frequency domains and defined each of them according to the frequency characteristics. We focused more on objective quality in the low-frequency domain and tried to enhance perceptual quality in the high-frequency domain.

Frequency subband-specific loss on the wavelet transform domain
In order to maximize the objective and perceptual quality, we took a strategy to secure high objective quality first and minimize the loss of objective quality while pursuing perceptual quality, and we defined the necessary loss in the wavelet transform domain as follows.
The variables x,ỹ, and y denote the noisy input CT image, denoised CT output image, and clean CT image. Further, wx, wỹ and wy are denoted as x,ỹ and y in the SWT domain with normalization in each subband. Moreover, G is a generator and can be a denoiser as well. Thus, we can formulate the following: To accomplish high objective quality, we first define the objective loss in the low-and highfrequency domains in the wavelet transform domain as follows: where w and h are the width and height of the image, low is the one channel subband of LL 2 , and high includes the LH 2 , HL 2 , HH 2 , LH 1 , HL 1 , and HH 1 subbands. Then, we define the total objective loss by combining L lo and L ho .
where α low is a hyper-parameter and controls the weight of the low-frequency subband. With the proper parameter, optimizing L wo with our proposed network achieves the best performance in objective quality compared to the existing algorithms. We defined the VGG loss to pursue perceptual quality as follows: The VGG operation used VGG-19 [10] to extract the feature maps at the second convolutional layer after the second maxpool operation, called ReLU2_2. We applied the Gram matrix to the feature map from VGG-19, which is denoted as GM. In addition, w f and h f are the width and height of feature map after the Gram matrix output. Adversarial loss is defined in both the low-and high-frequency domains: where D high and D low are the discriminators for the high-and low-frequency domains respectively. By introducing adversarial loss, we can defined the GAN network [8] to optimize the following: We combined all losses in the high-and low-frequency domains with the objective loss in the wavelet domain for perceptual quality: Then, the total loss is redefined in the same way as in L wo : where α low is the same value from (10). α ho , α hvgg , and α hGAN are the high-frequency objective weight, VGG loss weight, and GAN loss weight, and α lo , α lvgg , and α lGAN are the low-frequency objective weight, VGG loss weight, and GAN loss weight respectively in the wavelet loss domain. These hyperparameters control the importance of each loss. In the low-frequency domain, we set α lo , α lvgg , and α lGAN as 1.0, 0.1 and 0.0001. We assigned a high weight to the objective loss and less weight to the perceptual loss terms in the low-frequency domain. Moreover, α lvgg is set as 0.1 because it also contains textures and edges, even in the low-frequency domain.
In the high-frequency domain, we set α ho , α hvgg , and α hGAN as 0, 1.0, and 0.01, respectively. We did not add objective loss in the high-frequency domain and assigned higher weights for the VGG and GAN loss than in the low-frequency domain. Thus, we maintained the detailed information of the denoised image while minimizing the loss of the objective quality. We evaluated the performance of our proposed model optimized for both L wo and L wp loss functions.
Loss functions for objective and perceptual losses, which are commonly adopted for other algorithms, are defined similarly to L wo and L wp , but without the SWT operation: We used these loss functions for the network based on the U-Net and EDSR and α vgg and α gan as 1.0 and 0.001, respectively, to compare results.

Normalization of wavelet subbands
We normalized each wavelet subband using the mean and standard deviation and updated them in the same way by updating the mean and standard deviation in the batch normalization [33]. First we calculated the batch mean (μ B ) and standard deviation (σ B ) as follows: where m is the number of training samples for a mini-batch. Then, we updated the mean and variance for normalization: Var½x We updated the mean and variance for 10,000 iterations, and afterward, we froze them. The role of normalization of wavelet subbands is optimizing every subband equally by balancing the weights of each subband. As depicted in Fig 6, the low-frequency (LL 2 ) subband has relatively high coefficients, thus, the loss in low frequency is relatively high compared to other high-frequency losses. By normalizing each subband, we can readjust the subband coefficients with the normal distribution and evenly assign the weights of each subband in the wavelet domain loss. Without normalization, low frequency subband, which has large energy, has higher prior to optimize than high frequency subbands, thus objective quality was slightly decreased.

Experimental Setup
2.6.1 In vivo and phantom data acquisition. We scanned the anthropomorphic phantoms of the chest, neck, and pelvis and a Catphan 500 phantom [34] on two different multislice CT scanners (Siemens SOMATOM Sensation Open and Toshiba Aquilion TSX-201A). Table 2 lists the acquisition protocols used to obtain the image for the phantom datasets. A fixed tube voltage (120 kV) was used in all images. After acquiring a normal-dose image using routine CT acquisition protocols for each organ used in the clinic, a low-dose image pair was  [35]. These clinical data were obtained after approval by the institutional review board of Mayo clinic. The library was HIPAA compliant and built with a waiver of informed consent. The Mayo Clinic dataset contains anonymized CT images of ten patients in total. Each patient record contains normal-dose abdominal CT images and quarterdose CT images. There are 1 mm and 3 mm thicknesses in the dataset, and we used both thicknesses for our training and testing. We first chose CT slices of three patients for test dataset, which include more number of lesions with small shape and can be regarded as clinically difficult task for diagnosis. CT slices of other seven patients were used for training and validation. We divided the training and validation of the CT slices of seven patients randomly with a ratio of 0.95:0.05.

Experimental setting.
For all the experiments, we used the Adam solver [36]. All networks were trained with a learning rate of 0.0002. We scheduled the learning rate to halve when the minimum loss does not change after five iterations. All images were normalized between 0 and 1 and were used as input for the proposed method. Data augmentation is performed on training images, including random rotations of 90, 180, and 270 and flipping horizontally. In each training batch, a random patch with a size of 80 × 80 is extracted as input. The networks are implemented in the PyTorch framework and trained with four Nvidia Tesla V100 graphical processing units. We set the same settings except that images are normalized between -0.5 and 0.5 when we implemented the existing state-of-the-art algorithms to compare the performance.

Effectiveness of designing the network.
Our proposed network design was modified from EDSR, and its performance was improved through the following structural modifications. A global skip connection allows the network to learn the residual, which was introduced in the DnCNN [30], thus enabling the network to learn the image noise. Therefore, the global skip connection has been adopted in denoising algorithms in [28,37]. Then, the network performance was enhanced by replacing the CT image input with the SWT seven-channel subbands. Finally, we adopted normalization to increase the network performance.
The performance increase in the PSNR by gradually adding each feature was summarized in Table 3, demonstrating that the strategy effectively achieved higher objective quality in the network design.

Weight of a low-frequency subband in wavelet domain loss.
We evaluated the objective image quality by varying the low-frequency weight α low in (10) by 0.15 from 0.2 to 0.8, and the resulting PSNR value is reported in Table 4. We did not include α low of 0 and 1.0 because the resulting PSNR values are very low, 8.403 and 28.401, respectively. As indicated in  Table 4, by assigning low-frequency weight as 0.2 and high-frequency weight as 0.8 (1-α low ), which mean we strengthen the high-frequency domain loss more than the low-frequency domain loss, we gain the best performance of objective quality with the proposed network. We applied this obtained α low (0.2) to (19) and trained the network to secure a high PSNR while pursuing perceptual quality.

Histogram distribution results.
We tested how the histogram distribution changes using the proposed algorithms in Fig 7. When we optimized the objective losses with L wo , the histogram distribution is zero-centered in high-frequency subbands; thus, the denoised images have less detailed information. However, in terms of objective quality, the zero-centered distribution is the average subband value, which coincides with the fact that the denoised image from the denoising network is the average output of all plausible images [38].
In contrast, when we optimized the perceptual loss with L wp , the histogram distribution of the denoised images in the high-frequency subbands demonstrates more comparable matching between normal-dose and denoised CT images. Thus, the denoised CT images have richer textures and patterns, which look like the ground truth CT images, whereas it might entail the loss of objective image quality.

Objective quality results.
We compared the proposed method with the existing state-of-the-art image denoising algorithms for RGB natural and LDCT images that maximize the objective quality using the Mayo Clinic dataset to validate the network effectiveness with the optimization of L wo . In natural RGB image denoising algorithms, it is still common to maximize only the objective quality, so we compared our algorithm with the optimization of L wo to the image denoising algorithms. For fair comparison, all algorithms were trained on Mayo

PLOS ONE
Clinic dataset. As summarized in Table 5, our proposed network with the optimization of L wo performed the best in terms of PSNR compared with other existing state-of-the-art denoising algorithms for natural RGB images and LDCT images.
In addition, we analyzed whether our proposed network maintains a better objective quality in the process of optimizing both objective (L wo ) and perceptual loss (L wp ) compared with the following existing LDCT denoising algorithms: U-Net [19], RED-CNN [3], WavResNet [5], WGAN-VGG [13], and CPCE3D [15]. The original U-Net is for segmentation tasks, but with a slight modification, it has also been widely used in denoising tasks [6]. When implementing WavResNet, instead of using the contourlet transform, we used the SWT with the Haar function. While implementing WGAN-VGG, U-Net replaced the original generator because it makes the network more stable when optimizing the network.
We can divide these LDCT denoising algorithms into two groups: one group with U-Net [19], RED-CNN [3], WavResNet [5], and our network with optimization L wo , and the other group with WGAN-VGG [13], CPCE3D [15], and our network with optimization L wp . The former group maximized the objective quality by optimizing the objective loss, and the latter group pursued the perceptual quality by optimizing the perceptual loss. As presented in Table 6, our two proposed networks (optimizing L wo and L wp ) have a higher objective quality than others in each group in both the PSNR and structural similarity index measure (SSIM) metrics.

Perceptual quality results.
To qualitatively compare the perceptual image quality, we selected several CT images. Window levels in Hounsfield unit (HU) are adjusted and written in figures. Fig 8 presents representative denoised output slices of the Catphan phantom. For a clearer visual comparison between the resulting images, their close-up images are also displayed in Fig 9. Our proposed method of optimizing L wo has the minimum noise among all

PLOS ONE
algorithms, revealing smooth surface over piece-wise constant regions with the same density.
If we compare our method with L wo (Fig 9(e)) to the networks optimizing the objective loss in Fig 9(b) to 9(d), it has less noise in the denoised output and the shape of the objects are better kept than the others. However, textural details are lost because the output results present blurry images. For instance, the linearly aligned dots in Fig 9(e) cannot be distinguished from each other. In contrast, the networks optimizing the perceptual loss in Fig 9(f) to 9(h) preserve these shapes and edges better than the networks optimizing the objective loss in Fig 9(b) to 9(e). Our proposed algorithm with L wp (Fig 9(h)) preserves the detailed structures better than other algorithms optimizing perceptual loss. For instance, Fig 9(h) exhibits clearer separation and accurate placement of the vertically aligned points than the other, thus, our algorithm outputs more reliable and realistic CT results. Among the compared algorithms optimizing perceptual

PLOS ONE
loss, the proposed algorithm with L wp had the highest quantitative PSNR value, as displayed in Table 6. The proposed algorithm exhibited the best qualitative noise reduction performance. Therefore, the proposed algorithm, which optimized L wp demonstrates the effectiveness of reducing noise and preserving information with phantom datasets.
From the Mayo Clinic dataset, we selected CT slices that contain lesions and bone tissues. Representative CT slices that contain a lesion are depicted in Figs 10 and 11. Perception of lesions, which can be understood as visibility, should not be degraded after producing denoised CT images from the proposed algorithms. As expected, our network with the optimization of L wo removed the noise better than the other algorithms, but the shapes are smoothed, which weakens the contrast of the lesions in the region of interest. The networks optimizing perceptual loss (WGAN-VGG, CPCE3D, and our network with optimizing L wp ) recovered the loss of the contrast, and the visibility was strengthened. Another CT slice that contains bone

PLOS ONE
tissues is depicted in Figs 12 and 13. Bone tissues are a good indicator of the sharpness and edges because they have a very subtle but complicated texture pattern. Among networks that minimize objective loss, our network with L wo also has a more accurate texture pattern with relatively less loss of the trabecular microstructure than the others, which indicates that perceptual quality can be more easily enhanced from the proposed network. In addition, WGAN-VGG, CPCE3D, and the proposed network with the optimization of L wp exhibit comparable perceptual quality to the reference normal-dose CT image. They all have similar texture patterns, and the details are slightly lost from the high-dose CT image.
Moreover, our proposed network with L wp has a higher objective quality with less noise than WGAN-VGG and CPCE3D. This result is significant because we achieved higher noise reduction performance than the WGAN-VGG and CPCE3D, even when we maximized the

PLOS ONE
perceptual quality to be comparable to the WGAN-VGG and CPCE3D. Thus, our proposed network demonstrated a better PD trade-off than the current state-of-the-art methods.

Blind reader study with radiologists.
To conduct a blind reader study, we selected a representative group of 10 denoised CT slices from LDCT denoising algorithms. Seven CT slices are from the Mayo Clinic dataset, and the remaining three CT slices are from phantom datasets. Reference normal-dose and low-dose images are included in each group, and we randomly showed our denoised CT images to two radiologists with more than 10 years of experience in chest CT interpretation. They were asked to score each image with the following criteria: noise reduction, structural preservation, and overall quality. The score ranged from 1 (unacceptable) to 5 (excellent), and the resulting scores for each algorithm are reported as the mean score of two radiologists plus or minus the standard deviation (mean ±std) in Table 7.
In general, algorithms optimizing objective loss (RED-CNN, WavResNet, U-Net and ours with L wo ) have excellent noise reduction performance, and algorithms pursuing perceptual loss (WGAN-VGG, CPCE3D, and ours with L wp ) received excellent scores in terms of structural preservation and overall quality. Our proposed network with L wo optimization achieved the best performance in noise reduction, and our network with L wp optimization performs the best performance in structural preservation and overall quality. Interestingly, ours (L wo ) scored a higher score in structural preservation and overall quality than WGAN-VGG [13] although ours (L wo ) optimized the objective quality whereas WGAN-VGG optimized perceptual quality.

Perception-distortion trade-off curve.
To prove that our proposed algorithms have a more effective PD trade-off, we implemented two networks, U-Net [19] and the proposed network, and optimized two networks with different loss functions: L wo , L wp , L o , and L p in (10), (19), (20) and (21), respectively. Their objective quality results are summarized in Table 8. Their representative whole cropped CT slices for effective perceptual quality are depicted in are in Fig 14. From Table 8, from the point of objective quality, reveals the following: • We achieved a higher PSNR with the optimization of L wo than the optimization of L o in both networks: 39.677 < 39.695 (U-Net) and 39.875 < 39.950 (ours).
• Optimization with L wp had a higher PSNR than the optimization with L p in both networks: 37.717 < 39.013 (U-Net) and 38.856 < 39.110 (ours).
• The PD trade-off from the optimization with L wo to the optimization L wp is smaller than from the optimization with L o to the optimization L p : 0.682 < 1.960 (U-Net) and 0.840 < 1.019 (ours).

PLOS ONE
The CT images in the top row of Fig 14, which are the output of the network optimizing the objective loss, exhibit higher noise reduction, but the marked areas show overly smoothed results for shape. In contrast, the network optimizing the perceptual loss generated images with sharper edges and higher contrast, displayed in the bottom row of Fig 14. The proposed network with optimization of L wp has a higher PSNR than the others in the bottom row. Therefore, it has a lower noise level in our denoised results and low distortion compared to the ground truth image. From these facts, the wavelet perceptual loss (L wp ) effectively improves the trade-off between the PD relationship. Thus, we minimize the loss of the objective quality while maximizing the perceptual quality.
To provide an example of the PD trade-off using our blind study results, we chose the U-Net, WGAN-VGG, and the proposed networks with optimization of L wo and L wp and depicted the trade-off in Fig 15. Because we trained WGAN-VGG with U-Net, it is a network that optimized the perceptual quality from the U-Net network. In the case of U-net in Fig 15, in order to raise the perceptual quality from 3.60 to 3.74, the objective quality decreased from 39.68 to 37.71. For our proposed algorithm, to improve the perceptual quality from 3.81 to 4.09, the objective quality was reduced from 39.95 to 39.11. Here, the value obtained by dividing the decrease in objective quality by the increase in perceptual quality can be interpreted as a trade-off value of objective quality to increase the unit perceptual quality value, and the Table 8. Trade-off of perception-distortion of U-Net and our proposed network. The metric is the peak signal to noise ratio (PSNR).  resulting division values of U-Net and our proposed algorithm are -14.0 and -3.0, respectively. Although the blind review score is subjective to radiologists, this measure demonstrates that the proposed methods with the wavelet domain loss have a better trade-off than the U-Netbased network between the optimization of L o and L p .

Discussion
In this paper, we proposed a novel LDCT denoising method to generate high objective and perceptual quality denoised images. Moreover, recent studies have focused more on the perceptual quality in LDCT denoising; however, the objective quality is still an important key factor to measure algorithm performance. Thus, our motivation for this paper was to maintain the objective quality to be as high as possible while enhancing the perceptual quality. Our key contributions to accomplish this goal are as follows. 1) We developed the network with the SWT, which can achieve the highest objective quality among the state-of-the-art denoising algorithms for natural and LDCT images. 2) We also presented a novel wavelet subband-specific learning strategy to preserve the structural and textural information in images while minimizing the loss of the objective quality. As a result, we demonstrated a better PD trade-off with the proposed method using the wavelet domain loss. Finally, 3) we tested the performance of the proposed methods with a phantom dataset and the NIH-AAPM-Mayo Clinic Low Dose CT Grand challenge dataset, demonstrating that ours can achieve better objective quality while preserving the perceptual quality than other state-of-the-art LDCT denoising methods.

PLOS ONE
The lack of proper metrics for measuring perceptual quality in an LDCT denoising task made it difficult to evaluate the perceptual quality of the algorithms assessed in this paper. To evaluate the perceptual quality, we invited experienced radiologists to conduct a blind reader study. However, conducting a blind reader study is very time-consuming and expensive, and the outcome could depend on the radiologists' experience [48]. As future work, we plan to develop a metric for perceptual image quality that is well correlated with the human visual system's characteristics in evaluating LDCT image quality. If such a metric were developed, it would be expected to evaluate the perceptual image quality at a low cost with little time investment.
As the network is trained using one-to-one mapping from low-dose to normal-dose CT images and normal-dose images are not often clean images with no noise, our proposed networks might learn the residual noise of the target normal-dose CT images. In addition, as normal-dose CT images are set as the standard of the network performance measurement, the networks cannot generate a denoised image that can exceed the quality of the normal-dose CT images. According to the overall quality reported in Table 7, WGAN-VGG and CPCE3D were slightly lower than the normal-dose image score value, and only our algorithm was higher than the normal-dose image score value with a marginal difference. This outcome is a very common problem in LDCT denoising algorithms, but it does not seem to be a problem that cannot be overcome. For instance, one could create a network capable of deriving an image superior to the original reference image by integrating unsupervised learning into the essential supervised learning-based LDCT denoising problem [49].
Until recently, most LDCT denoising has focused on post-processing denoising due to the inability to access 2D projection data or proprietary reconstruction software. However, this post-processing method in the image domain has a disadvantage in that it cannot effectively suppress noise or artifacts that have already been introduced in the process of reconstructing the projection images to the 3D CT images with filtered back projection [50]. Recently, research results on reconstructing 2D images into 3D images using neural networks have been published [51]. As other future work, we plan to optimize these reconstruction networks for the denoising task incorporating 2D projection images to build an even better denoising model. Moreover, as volumetric CT images have 3D spatial information, we can employ spatial information in the out-of-plane directions to further enhance our denoising networks [15,17].
Last but not least, due to the difficulty to obtain pairs of low-and normal-dose CT images, researches on unsupervised and self-learning based denoising is being more actively conducted. CycleGAN [25] to translate from noisy to clean CT images was successfully applied to LDCT denoising [24,52]. Self-learning-based models using only noisy images have been proposed in LDCT denoising tasks. Studies [22,23,[53][54][55] combined self-learning strategy with a CT reconstruction pipeline or a physics-based noise model. Noise2Context [56] and Noise2Neighbors [57] effectively suppressed noise with a physics-based CT model. Although these unsupervised and self-learning models successfully reduced noises, they still have large margin to follow with the state-of-the-art supervised LDCT denoising models [3,5,13,15,19]. Furthermore, as they still have focused more on noise reduction, models to enhance perceptual quality [13,58] or our work can be combined with these unsupervised or self-learning LDCT denoising models to secure denoised CT images with both excellent objective and perceptual quality.

Conclusion
In conclusion, the studied networks optimizing the objective loss exhibited excellent performance in suppressing noise at the cost of the loss in detailed textures and edges that are important for clinical diagnosis. In contrast, the networks optimizing the perceptual loss resulted in relatively high noise in generating realistic CT images with high perceptuality. With the key insight that high-and low-frequency components in an image have different characteristics, we proposed a novel network capable of achieving high objective and perceptual quality using the presented frequency subband-specific loss in the wavelet domain. Our proposed methods demonstrate the effective PD trade-off in LDCT denoising. With phantom and clinical datasets, our proposed methods result in an accurate and realistic CT image and achieve better performance than the existing state-of-the-art methods in terms of objective and perceptual quality.